-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poll jobs by comparing both priority and insert_time #344
Conversation
Codecov Report
@@ Coverage Diff @@
## master #344 +/- ##
==========================================
- Coverage 68.37% 68.35% -0.03%
==========================================
Files 17 17
Lines 860 891 +31
Branches 104 112 +8
==========================================
+ Hits 588 609 +21
- Misses 242 250 +8
- Partials 30 32 +2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #344 +/- ##
==========================================
+ Coverage 68.37% 68.63% +0.26%
==========================================
Files 17 17
Lines 860 915 +55
Branches 104 117 +13
==========================================
+ Hits 588 628 +40
- Misses 242 252 +10
- Partials 30 35 +5
Continue to review full report at Codecov.
|
@Digenis |
Your solution implements the second idea in #187. Your code reveals that the second idea has another drawback. |
The compatibility attribute trick was a nice idea by the way |
But the original code also iterates the queues, Lines 20 to 26 in 4520c45
|
FYI: #198 |
It was stopping on the first non-empty queue. Unless I misunderstand |
poll() ends when returnValue() is called. Can you explain ‘race conditions‘ in details, maybe with an example? |
How about the third commit? |
A multiple queue consumer scenario. |
Also, I just realized that the FIFO principal is still violated. A limit of 2 parallel processes. On hour 13:00 projectA gets 5 jobs scheduled, with queueA ids (in sqlite) 1,2,3,4,5 By hour 13:05, projectA's job with queueA id 1 has finished and 2, 3 are running By hour 13:10, projectA's job with queueA id 2 finishes projectA job with queue id 3, although scheduled at 13:00, Of course this is not exactly the current behaviour of your branch The in-queue id defines a priority too — the insertion order. |
So, how do we solve this while keeping the multiple queue approach? SELECT 'projectA', id FROM queueA
UNION ALL
SELECT 'projectB', id FROM queueB
ORDER BY priority DESC -- then what column?
-- the information about the global insertion order was never stored
LIMIT 1; At this point, unless you have a business-critical non-upgradeable legacy system In the poller, an equivalent of In all queues, a new datatime column, saving the insertion time, A master queue saving a global insertion order CREATE TABLE queue (id INTEGER PRIMARY KEY,
project TEXT NOT NULL,
foreign_id INTEGER NOT NULL); which is close to the unified queue idea, Of all the desperate attempts to keep backwards compatibility, |
So, if cancel.json is called while poll() is iterating all the queues with
|
What are the names of these projects and the priorities of each job? |
Even though Maybe we can/should make it in v1.3.0, which provides backward compatibility. Check out my fourth commit, which saves the highest priority of each project in project_priority_map, |
Changes in the fifth commit:
|
Even if it's backwards compatible with user code, it's not with user data in the queue. If we break compatibility, it better be once and better be worthy. But I also don't want to let users fix it on their own. |
But it only ignores the pending jobs on the first startup,
Simply use the same name ‘contrib’? |
What about inserting a new column into the existing table via |
The 6th commit introduces the ensure_insert_time_column method |
@Digenis |
I'm a bit busy right now I think I'll be able to review it in the next 2 days |
scrapyd/sqlite.py
Outdated
@@ -144,6 +147,24 @@ def clear(self): | |||
self.conn.commit() | |||
self.update_project_priority_map() | |||
|
|||
def ensure_insert_time_column(self): | |||
q = "SELECT insert_time FROM %s LIMIT 1" % self.table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also:
SELECT sql FROM sqlite_master WHERE type='table' AND name='spider_queue';
-- ⇒ CREATE TABLE spider_queue (id integer primary key, priority real key, message blob)
but is it any better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about 'sqlite_master' before.
The ensure_insert_time_column method is updated in the 7th commit.
@@ -82,7 +83,7 @@ class JsonSqlitePriorityQueue(object): | |||
"""SQLite priority queue. It relies on SQLite concurrency support for | |||
providing atomic inter-process operations. | |||
""" | |||
queue_priority_map = {} | |||
project_priority_map = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like "master table" solution I was talking about
actually implemented as a singleton to share state between all instances of the queue class
and save us lot of io and cpu cycles,
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fast to find out the queue to pop in poll()
since project_priority_map is a dict like:
{'project1': (0.0, -1561646348), 'project2': (1.0, -1561646349)}
So, there's no need to introduce an actual "master table".
scrapyd/sqlite.py
Outdated
@@ -131,6 +131,12 @@ def clear(self): | |||
self.conn.execute("delete from %s" % self.table) | |||
self.conn.commit() | |||
|
|||
def get_highest_priority(self): | |||
q = "select priority from %s order by priority desc limit 1" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but since additional selects are ran for almost all queue method calls
we are not saving cpu cycles or io
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduce SQLite triggers in the 8th commit,
in which project_priority_map would be updated
whenever an INSERT/UPDATE/DELETE occurs.
This is intended to speed up poll() and avoid race conditions,
rather than to save CPU cycles or io.
I think the cost is acceptable.
Comments are to be red in chronological order (not commit order). I feel reluctant to even make comments. |
Remove unnecessary getattr statement Add queue_priority_map to save highest priority Save the highest priority of each project in project_priority_map project_priority_map stores (priority, -timestamp) as value Add ensure_insert_time_column() Query sqlite_master in ensure_insert_time_column() Introduce create_triggers()
200066b
to
1a0cb2b
Compare
Codecov Report
@@ Coverage Diff @@
## master #344 +/- ##
==========================================
+ Coverage 68.37% 68.98% +0.61%
==========================================
Files 17 17
Lines 860 906 +46
Branches 104 116 +12
==========================================
+ Hits 588 625 +37
- Misses 242 247 +5
- Partials 30 34 +4
Continue to review full report at Codecov.
|
If so, how can we go on discussing?
No, this PR aims to fix #187 in a creative and effective way:
I think we should fix #187 in v1.3 as the 'priority' parameter is exposed in PR #161, |
The queue table name is changed from 'spider_queue' to 'spider_queue_with_triggers'. This PR introduces SQLite triggers which are stored in the database, Renaming the table name ensures upgrade and downgrade work well anytime, |
As Digenis wrote, if we try for backwards compatibility, then "It's dirty fixes all the way from here" This solution is very clever, but probably too clever for a project that only gets maintainer attention every few years. It takes too long to understand how it works and how to modify it – compared to a simpler solution (that breaks compatibility). So, I'll close this PR, though the test updates might be useful in future work. Edit: Unrelated to closure, but I think the last commit breaks backwards compatibility for user data (code starts using a new table without migrating any of the data from the old table). |
This PR use project_priority_map to store (priority, -timestamp) as value,
in order to find out the queue to pop.
Fix #187 (the updated test_poll_next() demonstrates the effect).
Also, provide backward compatibility for custom SqliteSpiderQueue
and JsonSqlitePriorityQueue.